Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.
translated by 谷歌翻译
Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.
translated by 谷歌翻译
Training effective embodied AI agents often involves manual reward engineering, expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization. Another approach is to use neural architectures alongside self-supervised objectives which encourage better representation learning. In practice, there are few guarantees that these self-supervised objectives encode task-relevant information. We propose the Scene Graph Contrastive (SGC) loss, which uses scene graphs as general-purpose, training-only, supervisory signals. The SGC loss does away with explicit graph decoding and instead uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment. The SGC loss is generally applicable, simple to implement, and encourages representations that encode objects' semantics, relationships, and history. Using the SGC loss, we attain significant gains on three embodied tasks: Object Navigation, Multi-Object Navigation, and Arm Point Navigation. Finally, we present studies and analyses which demonstrate the ability of our trained representation to encode semantic cues about the environment.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
大量数据集和高容量模型推动了计算机视觉和自然语言理解方面的许多最新进步。这项工作提出了一个平台,可以在体现的AI中实现类似的成功案例。我们提出了Procthor,这是一个程序生成体现的AI环境的框架。 Procthor使我们能够采样多种,交互式,可自定义和性能的虚拟环境的任意大型数据集,以训练和评估在导航,互动和操纵任务中的体现代理。我们通过10,000个生成的房屋和简单的神经模型的样本来证明procthor的能力和潜力。仅在Procthor上仅使用RGB图像训练的模型,没有明确的映射,并且没有人类任务监督在6个体现的AI基准中产生最先进的结果,用于导航,重排和手臂操纵,包括目前正在运行的Habitat 2022,AI2-- Thor重新安排2022,以及机器人挑战。我们还通过对procthor进行预训练,在下游基准测试上没有进行微调,通常会击败以前的最先进的系统,从而访问下游训练数据。
translated by 谷歌翻译
在过去的几年里,目睹了体现AI领域的实质性进展,其中镜像生物对应物现在能够学习互动以实现复杂任务。尽管取得了这一成功,但生物生物仍然在这些模拟代理中持有一个大的优势:适应。虽然生活和模拟代理人都做出决定实现目标(策略),但生物生物已经发展以了解他们的环境(传感)并响应它(生理学)。这些因素的净增益取决于环境,有机体相应地适应。例如,在低视力水生环境中,一些鱼类已经进化了特定的神经元,这些神经元提供了可预测的,但令人难以置信的快速战略,以逃离掠夺者。哺乳动物已经丢失了这些反应性系统,但它们具有更大的视野和脑电路,能够理解许多未来的可能性。虽然传统的体现特工操纵了一个环境,但我们争论了一个内省代理人,他们认为自己的环境中的能力。我们表明,不同的环境产生了极大的最佳设计,并且增加的长期规划通常远远不如其他改进,例如增加物理能力。我们展示了这些调查结果来扩大所体现的AI越来越复杂的模型的改进的定义。正如在大自然中,我们希望将策略作为一个工具,其中包括在许多工具中,以在环境中取得成功。代码可在:https://github.com/sarahpratt/introspective。
translated by 谷歌翻译
体现了AI已经显示出对模拟中的丰富机器人任务的结果,包括视觉导航和操纵。事先工作通常与最短的路径一起追求高成功率,同时在很大程度上忽略了互动期间碰撞引起的问题。这种缺乏优先级识别是可以理解的:在模拟环境中,不存在破坏虚拟对象的固有成本。因此,尽管最终成功,但训练有素的代理经常具有与对象的灾难性碰撞。在机器人社区中,碰撞成本大,碰撞避免是一项长期的和关键的话题,以确保机器人可以安全地部署在现实世界中。在这项工作中,我们将第一步迈向碰撞/干扰体现AI代理,用于视觉移动操作,促进真正的机器人安全部署。我们在核心开发了一种新的干扰 - 避免方法是扰动预测的辅助任务。当与干扰罚款结合时,我们的辅助任务通过知识蒸馏到代理商的知识蒸馏而大大提高了样本效率和最终性能。我们对Manipulathor的实验表明,在用新型物体的测试场景上,我们的方法将成功率提高了61.7%至85.6%,而且在原始基线的29.8%至50.2%的情况下,成功率没有干扰。广泛的消融研究表明了我们流水线方法的价值。项目网站位于https://sites.google.com/view/disturb-free
translated by 谷歌翻译
对比语言图像预测(剪辑)编码器已被证明是有利于对分类和检测到标题和图像操纵的一系列视觉任务。我们调查剪辑视觉骨干网的有效性,以实现AI任务。我们构建令人难以置信的简单基线,名为Emplip,没有任务特定的架构,归纳偏差(如使用语义地图),培训期间的辅助任务,或深度映射 - 但我们发现我们的改进的基线在范围内表现得非常好任务和模拟器。 empclip将Robothor ObjectNav排行榜上面的20分的巨额边缘(成功率)。它使ithor 1相重新安排排行榜上面,击败了采用主动神经映射的下一个最佳提交,而且多于固定的严格度量(0.08至0.17)。它还击败了2021年栖息地对象挑战的获奖者,该挑战采用辅助任务,深度地图和人类示范以及2019年栖息地进程挑战的挑战。我们评估剪辑视觉表示在捕获有关输入观测的语义信息时的能力 - 用于导航沉重的体现任务的基元 - 并且发现剪辑的表示比想象成掠过的骨干更有效地编码这些基元。最后,我们扩展了我们的一个基线,产生了能够归零对象导航的代理,该导航可以导航到在训练期间未被用作目标的对象。
translated by 谷歌翻译
在实践中,只要可以设计教学代理以提供专家监督,仿制学习就是纯粹的加强学习。但是,我们表明,当教学代理商决定与学生无法访问的特权信息时,在模仿学习期间,此信息被边缘化,导致“模仿差距”,导致潜在,差距。先前的工作通过仿制学习的仿制学习来弥合这一差距。虽然经常成功,但逐步的进展失败,需要频繁切换勘探和记忆之间的频繁交换。为了更好地解决这些任务并减轻模仿缺口,我们提出“适应性不管”(顾问)。顾问在培训期间动态重量仿制和奖励的加固学习损失,在模仿和探索之间启用了在线切换。在Gridworlds中设置的一套充满挑战的任务,多代理粒子环境和高保真3D模拟器,我们展示了与顾问的在线交换,优于纯粹的模仿,纯粹的加固学习以及它们的顺序和并行组合。
translated by 谷歌翻译
我们介绍了互动室(Thor),这是一个视觉AI研究的框架,可在http://ai2thor.allenai.org上找到。AI2-这是由几乎逼真的3D室内场景组成的,在该场景中,AI代理可以在场景中导航并与对象进行交互以执行任务。AI2-这可以在许多不同的领域进行研究,包括但不限于深入强化学习,模仿学习,通过互动,计划,视觉问答答案,无监督的表示学习,对象检测和细分以及认知模型。AI2的目的是促进构建视觉上智能模型,并将研究推向该领域。
translated by 谷歌翻译